Skip to content

Add scotch-parser semantic comparison and steal plan#3

Open
GordonLeong wants to merge 1 commit intomainfrom
codex/compare-semantic-tree-building-with-sec-parser
Open

Add scotch-parser semantic comparison and steal plan#3
GordonLeong wants to merge 1 commit intomainfrom
codex/compare-semantic-tree-building-with-sec-parser

Conversation

@GordonLeong
Copy link
Owner

Motivation

  • Provide an implementation-focused comparison between edgartools 5.19 and the analyzed sec-parser heuristics to decide which deterministic rules to reimplement for scotch-parser v0.1.
  • Prioritize small, testable enrichment passes (IDs, sentence splitting, table context, footnotes) that run after edgartools parsing and avoid rewriting sanitize/stamp unless strictly needed.
  • Produce a concrete integration plan, minimal stamping deltas, and a test/validation blueprint so downstream reader and card pipelines get stable provenance and sidecar output.

Description

  • Add a planning document at docs/internal/planning/scotch-parser-semantic-comparison.md that contains an executive comparison, categorized gap & steal matrix, prioritized top-5/next-5 steals, minimal sanitize/stamp delta guidance, a packages/scotch-parser/ folder sketch, pass sequencing, and a test plan.
  • The matrix maps each heuristic from the sec-parser analysis to edgartools status (covered/partial/not) and gives a steal decision with proposed pass name, inputs, deterministic rules, outputs, risks, and minimal tests for each YES.
  • Recommends v0.1 top steals (assign_semantic_ids_pass, sentence_split_pass, table_context_linking_pass, table_type_refinement_pass, footnote_linking_pass) and minimal, opt-in stamping deltas (--stamp-table-cells, --stamp-footnote-refs) only if required.
  • Specifies how to call edgar.documents.parse_html as the base engine, enrichment pass ordering, sidecar schema expectations, and validation metrics for deterministic quality checks.

Testing

  • This change is documentation-only and did not modify parser runtime code or unit tests.
  • No automated unit tests were added or executed as part of this PR because the change is a planning/spec document.
  • The new file presence and contents were validated locally by listing and printing the file to ensure it contains the planned comparison, matrix, implementation plan, and test plan.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant